Implementing ak._v2.to_parquet. #1440

jpivarski · 2022-04-22T20:49:42Z

No description provided.

codecov · 2022-04-22T21:17:53Z

Codecov Report

Merging #1440 (2ae5205) into main (edfce38) will increase coverage by 0.08%.
The diff coverage is 63.40%.

Impacted Files	Coverage Δ
src/awkward/_v2/_connect/cuda/__init__.py	`0.00% <0.00%> (ø)`
src/awkward/_v2/_connect/jax/__init__.py	`0.00% <0.00%> (ø)`
src/awkward/_v2/contents/recordarray.py	`82.26% <ø> (+0.76%)`	⬆️
src/awkward/_v2/contents/unionarray.py	`86.50% <ø> (ø)`
src/awkward/_v2/operations/convert/ak_from_cupy.py	`50.00% <0.00%> (+23.33%)`	⬆️
src/awkward/_v2/operations/convert/ak_from_iter.py	`93.75% <ø> (ø)`
src/awkward/_v2/operations/convert/ak_from_jax.py	`50.00% <0.00%> (-25.00%)`	⬇️
src/awkward/_v2/operations/convert/ak_to_cupy.py	`33.33% <0.00%> (+23.95%)`	⬆️
src/awkward/_v2/operations/convert/ak_to_jax.py	`33.33% <0.00%> (-41.67%)`	⬇️
src/awkward/_v2/operations/describe/ak_backend.py	`10.00% <0.00%> (-2.50%)`	⬇️
... and 92 more

…ng 'OSError: Data size too small for number of values (corrupted file?)' on read-back.

for more information, see https://pre-commit.ci

martindurant · 2022-04-25T21:32:04Z

Please see here for the list of options we thought would be important when we talked this through.

…ikit-hep/awkward-1.0 into jpivarski/start-v2-to_parquet

jpivarski · 2022-04-25T21:56:47Z

compression per data type or per leaf column ("path.to.leaf": "zstd" format)

Per leaf column is implemented: all ParquetWriter arguments that accept a dict or list can be selected via a dict of Awkward column selectors. Selectors are strings or iterables of strings that get passed to Form.select_columns (which slice through Forms to make new Forms):

>>> array = ak._v2.Array([[{"x": 1.1, "y": [1], "z": "one"}, {"x": 2.2, "y": [1, 2], "z": "two"}], [], [{"x": 3.3, "y": [1, 2, 3], "z": "three"}]])
>>> print(array.layout.form.select_columns("x"))
{
    "class": "ListOffsetArray",
    "offsets": "i64",
    "content": {
        "class": "RecordArray",
        "contents": {
            "x": "float64"
        }
    }
}
>>> print(array.layout.form.select_columns("y"))
{
    "class": "ListOffsetArray",
    "offsets": "i64",
    "content": {
        "class": "RecordArray",
        "contents": {
            "y": {
                "class": "ListOffsetArray",
                "offsets": "i64",
                "content": "int64"
            }
        }
    }
}
>>> print(array.layout.form.select_columns("z"))
{
    "class": "ListOffsetArray",
    "offsets": "i64",
    "content": {
        "class": "RecordArray",
        "contents": {
            "z": {
                "class": "ListOffsetArray",
                "offsets": "i64",
                "content": {
                    "class": "NumpyArray",
                    "primitive": "uint8",
                    "parameters": {
                        "__array__": "char"
                    }
                },
                "parameters": {
                    "__array__": "string"
                }
            }
        }
    }
}
>>> print(array.layout.form.select_columns(["x", "y"]))
{
    "class": "ListOffsetArray",
    "offsets": "i64",
    "content": {
        "class": "RecordArray",
        "contents": {
            "x": "float64",
            "y": {
                "class": "ListOffsetArray",
                "offsets": "i64",
                "content": "int64"
            }
        }
    }
}

They're wildcard-friendly and don't have the "list.item/list.element" baggage.

Once sliced, column_types gives you Parquet-relevant types for the columns:

>>> array.layout.form.column_types()
(dtype('float64'), dtype('int64'), 'string')

Again, it ignores whatever lists or option-types it encountered on the way down to the leaves, and this "string" means string or bytestring, distinct from any dtypes.

They were to be used here:

https://github.com/scikit-hep/awkward-1.0/blob/6286e29ec769e95ef882f7f4a79eb43212bf0b53/src/awkward/_v2/operations/convert/ak_to_parquet.py#L123-L145

But Arrow is saying that nested column buffers made this way are too short, that the file is possibly corrupted. If I'm not misunderstanding something, this looks like a bug in pyarrow.

byte stream split for floats if compression is not None or lzma

See above.

partitioning

If the data is an iterator without len, then it will iterate over everything in it and make one row-group per item.

https://github.com/scikit-hep/awkward-1.0/blob/6286e29ec769e95ef882f7f4a79eb43212bf0b53/src/awkward/_v2/operations/convert/ak_to_parquet.py#L41-L50

parquet 2 for full set of time and int types
v2 data page (for possible later fastparquet implementation)

https://github.com/scikit-hep/awkward-1.0/blob/6286e29ec769e95ef882f7f4a79eb43212bf0b53/src/awkward/_v2/operations/convert/ak_to_parquet.py#L25-L26

I haven't tried it out yet.

dict encoding always off

That would be use_dictionary=False (though I think you'd want it on some string-valued fields).

https://github.com/scikit-hep/awkward-1.0/blob/6286e29ec769e95ef882f7f4a79eb43212bf0b53/src/awkward/_v2/operations/convert/ak_to_parquet.py#L172

But the apparent bug I mentioned above is making me hesitate about that.

jpivarski · 2022-04-26T19:57:08Z

This is pretty much done. I think the options might change; in particular, I think it would be much better to be writing compliant list types, but there's an issue with that.

martindurant · 2022-04-26T20:09:38Z

If the data is an iterator without len, then it will iterate over everything in it and make one row-group per item.

I meant directory name partitioning, ie., following group-by.

Per leaf column compression is implemented

(I forgot to mention that we discussed what decent defaults might be, per data type)

I know nothing about the possible pyarrow bug.

jpivarski · 2022-04-28T15:11:19Z

The variable named row_group_number is not strictly right because the writer might write more than one row group in a call. It should be named iteration_number.

Implementing ak._v2.to_parquet.

925c2af

jpivarski and others added 12 commits April 25, 2022 07:47

Fix for pylint.

6677863

Implemented all argument mappings.

86d18f8

use_dictionary and use_byte_stream_split for nested columns are causi…

6e55e06

…ng 'OSError: Data size too small for number of values (corrupted file?)' on read-back.

Oops, row_group_size.

9acf5e3

Start writing tests.

0265eb2

pyarrow 7 requires Python > 3.6.

bad04ee

requirements.txt syntax.

64a9c79

Also need fsspec.

57dcce0

[pre-commit.ci] auto fixes from pre-commit.com hooks

4808df4

for more information, see https://pre-commit.ci

Better repr parameters.

b2ef3b3

Better balance between value and type in repr.

00b4fa6

Formatting.

520aa04

jpivarski added 2 commits April 25, 2022 16:36

All tests pass for ak._v2.to_parquet.

badc86a

Merge branch 'jpivarski/start-v2-to_parquet' of https://github.com/sc…

6286e29

…ikit-hep/awkward-1.0 into jpivarski/start-v2-to_parquet

Done enough to merge.

2ae5205

jpivarski marked this pull request as ready for review April 26, 2022 19:57

jpivarski enabled auto-merge (squash) April 26, 2022 19:57

jpivarski merged commit c1ffcce into main Apr 26, 2022

jpivarski deleted the jpivarski/start-v2-to_parquet branch April 26, 2022 20:32

jpivarski mentioned this pull request Apr 28, 2022

ak._v2.to_parquet: row_group_number is not strictly right #1450

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing ak._v2.to_parquet. #1440

Implementing ak._v2.to_parquet. #1440

jpivarski commented Apr 22, 2022

codecov bot commented Apr 22, 2022 •

edited

Loading

martindurant commented Apr 25, 2022

jpivarski commented Apr 25, 2022

jpivarski commented Apr 26, 2022

martindurant commented Apr 26, 2022

jpivarski commented Apr 28, 2022

Implementing ak._v2.to_parquet. #1440

Implementing ak._v2.to_parquet. #1440

Conversation

jpivarski commented Apr 22, 2022

codecov bot commented Apr 22, 2022 • edited Loading

Codecov Report

martindurant commented Apr 25, 2022

jpivarski commented Apr 25, 2022

jpivarski commented Apr 26, 2022

martindurant commented Apr 26, 2022

jpivarski commented Apr 28, 2022

codecov bot commented Apr 22, 2022 •

edited

Loading